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Statistical  Traffic  Anomaly  Detection  in 
Time- Varying  Communication  Networks  • 

Jing  Wang  t  and  Ioannis  Ch.  Paschal idis,+  Fellow,  IEEE, 


Abstract — We  propose  two  methods  for  traffic  anomaly 
detection  in  communication  networks  where  properties 
of  normal  traffic  evolve  dynamically.  We  formulate  the 
anomaly  detection  problem  as  a  binary  composite  hypothesis 
testing  problem  and  develop  a  model-free  and  a  model-based 
method,  leveraging  techniques  from  the  theory  of  large 
deviations.  Both  methods  first  extract  a  family  of  Prob¬ 
ability  Laws  (PLs)  that  represent  normal  traffic  patterns 
during  different  time-periods,  and  then  detect  anomalies 
by  assessing  deviations  of  traffic  from  these  laws.  We 
establish  the  asymptotic  Newman-Pearson  optimality  of 
both  methods  and  develop  an  optimization-based  approach 
for  selecting  the  family  of  PLs  from  past  traffic  data.  We 
validate  our  methods  on  networks  with  two  representative 
time-varying  traffic  patterns  and  one  common  anomaly 
related  to  data  exfiltration.  Simulation  results  show  that 
our  methods  perform  better  than  their  vanilla  counterparts, 
which  assume  that  normal  traffic  is  stationary. 

Index  Terms — Statistical  anomaly  detection,  large  devi¬ 
ations  theory,  set  covering,  binary  composite  hypothesis 
testing,  cyber-security. 


I.  Introduction 

A  network  traffic  anomaly  is,  broadly  speaking,  an 
unusual  traffic  pattern  that  can  not  be  explained  by  the 
typical  variability  observed  in  communication  network 
traffic.  Traffic  anomalies  may  arise  either  due  to  oper¬ 
ational  malfunctions  (e.g.,  router  failures)  or  due  to  the 
presence  of  malicious  traffic  that  threatens  the  security  of 
the  network.  Automated  online  traffic  anomaly  detection 
has  received  a  lot  of  attention,  primarily  motivated  by 
security  considerations. 

Network  traffic  anomaly  detection  is  a  special  case  of 
more  broadly  defined  system  anomaly  detection  and  rele¬ 
vant  approaches  can  be  roughly  grouped  into  two  classes: 
signature-based  anomaly  detection ,  where  known  pat¬ 
terns  of  past  anomalies  are  used  to  identify  ongoing 
anomalies  [1,2],  and  change-based  anomaly  detection 
which  identifies  patterns  that  substantially  deviate  from 

*  Research  partially  supported  by  the  NSF  under  grants  CNS- 
1239021  and  IIS-1237022,  by  the  ARO  under  grants  W911NF-11-1- 
0227  and  W91  INF- 12- 1-0390,  and  by  the  ONR  under  grant  N00014- 
10-1-0952. 
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normal  patterns  of  operations  [3-6],  Signature-based 
methods  generate  few  false-alarms,  yet,  detection  rates 
can  be  quite  low,  e.g.,  below  70%  [7].  Furthermore, 
such  methods  cannot  detect  zero-day  attacks,  i.e.,  attacks 
not  previously  seen,  and  need  constant  (and  expensive) 
updating  to  keep  up  with  new  attack  signatures.  In 
contrast,  change-based  anomaly  detection  methods  are 
considered  to  be  more  economic  and  promising  since 
they  can  identify  novel  attacks.  In  this  work  we  focus  on 
change-based  anomaly  detection  methods,  in  particular 
methods  that  leverage  statistical  techniques. 

Statistical  anomaly  detection  consists  of  two  steps. 
The  first  step  is  to  characterize  “normal  behavior”  by 
analyzing  past  system  behavior.  The  second  step  is  to 
identify  time  instances  where  system  behavior  does  not 
appear  to  be  normal  by  monitoring  the  system  contin¬ 
uously.  For  anomaly  detection  in  communication  net¬ 
works,  [5]  presents  two  methods  to  characterize  normal 
behavior  and  to  assess  deviations  from  it  based  on  the 
theory  of  Large  Deviations  (LD)  [8].  Both  methods  con¬ 
sider  the  traffic,  which  is  viewed  as  a  sequence  of  flows, 
as  a  sample  path  of  an  underlying  stochastic  process 
and  “compare”  empirical  measures  of  current  network 
traffic  to  some  reference  network  traffic  model.  The  first 
method  -  called  model-free  -  assumes  that  traffic  consists 
of  an  independent  and  identically  distributed  (i.i.d.) 
sequence  of  flows,  while  the  second  method  -  called 
model-based  -  models  traffic  as  a  Markov  Modulated 
Process.  Both  methods  make  a  stationarity  assumption, 
postulating  that  the  statistical  properties  of  traffic  do  not 
change  over  time. 

However,  traffic  in  modern  communication  networks 
is  hardly  stationary  [9],  Internet  traffic  is  subject  to 
weekly  and  diurnal  variations  [10, 11],  Internet  traffic  is 
also  influenced  by  macroscopic  factors  such  as  important 
holidays  and  events  [12],  leading  to  spikes  in  the  rate 
of  flows  that  eventually  subside  after  some  period  of 
time.  Similar  phenomena  arise  in  local  area  networks  as 
well.  This  motivates  the  work  in  this  paper  that  aims  at 
developing  methods  which  can  accommodate  transient 
and  periodic  traffic  behavior. 

To  that  end,  we  present  two  methods  that  are  robust 
to  time-varying  traffic  behavior:  a  robust  model-free 
and  a  robust  model-based  method  that  can  be  seen  as 
generalizations  of  the  corresponding  methods  from  [5]. 
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Robustness,  here,  should  be  interpreted  in  the  same  vein 
as  in  [13],  that  is,  anomaly  detection  decisions  are  made 
from  a  composite  hypothesis  test  which  allows  for  ambi¬ 
guity  in  the  probabilistic  model  that  characterizes  normal 
behavior.  In  particular,  a  Probability  Law  (PL)  used  in 
[5]  to  characterize  normal  behavior  is  now  replaced  by 
a  family  of  PLs  where  each  member  of  this  family 
captures  different  “modes”  of  the  traffic  at  different  time 
intervals  (e.g.,  day,  night,  during  a  spike  in  traffic,  etc.). 
We  develop  new  robust  hypothesis  tests  in  this  setting 
and  establish  their  asymptotic  optimality  in  a  Neyman- 
Pearson  sense.  Second,  we  propose  a  two-stage  method 
to  estimate  a  family  of  PLs.  Our  two-stage  method 
transforms  a  hard  problem  (i.e.,  estimating  PLs  for  multi¬ 
dimensional  data)  into  two  well-studied  problems:  (z) 
estimating  one-dimensional  data  parameters,  and  (zz) 
formulating  the  problem  of  selecting  a  representative  set 
of  PLs  to  form  a  PL  family  as  a  set  covering  problem. 
As  we  will  see,  this  two-stage  method  is  suitable  for 
distributed  computation. 

The  remainder  of  the  paper  is  structured  as  follows. 
Sec.  II  formulates  system  anomaly  detection  as  a  bi¬ 
nary  composite  hypothesis  testing  problem  and  develops 
our  two  robust  methods.  Sec.  Ill  discusses  how  we 
parametrize  network  traffic  and  presents  our  approach 
for  estimating  PL  families.  Sec.  IV  validates  our  overall 
approach  in  several  realistic  settings  and  compares  the 
performance  of  our  robust  methods  to  their  non-robust 
counterparts.  Sec.  V  contains  some  concluding  remarks. 

Notation:  Throughout  the  paper  all  vectors  are  as¬ 
sumed  to  be  column  vectors.  We  use  lower  case  boldface 
letters  to  denote  vectors  and  for  economy  of  space  we 
write  x  =  (xi, . . . ,  xn)  for  the  column  vector  x.  We 
use  upper  case  boldface  letters  to  denote  matrices.  We 
use  script  letters  to  define  sets,  and  denote  by  |.4|  the 
cardinality  of  set  A. 

II.  Binary  composite  hypothesis  testing 

In  this  section,  we  present  a  hypothesis  testing  frame¬ 
work  for  making  anomaly  detection  decisions.  As  we 
mentioned  earlier,  the  crux  of  our  methodology  is  to 
model  network  traffic  as  a  stochastic  process.  Historical 
traffic  time-series  can  be  used  to  estimate  a  set  of 
parameters  for  this  stochastic  process,  giving  rise  to 
what  we  called  a  PL.  Then,  the  problem  of  detecting 
an  anomaly  in  real-time  is  equivalent  to  testing  whether 
current  observed  traffic  is  indeed  a  “likely”  sample  path 
of  the  stochastic  process  we  learned  from  past  history. 

The  general  problem  we  will  consider  is  testing 
whether  a  sequence  of  observations  Q  —  {g1,...,^"} 
is  a  sample  path  of  a  stochastic  process  A  (hypothesis 
Ha)-  The  stochastic  process  Sf  is  assumed  to  be  discrete¬ 
time,  thus,  a  sample  path  of  length  n  can  be  denoted 
by  Sf  =  {G1, . . . ,  G"}.  All  random  variables  G*  are 


discrete  and  their  sample  space  is  a  finite  alphabet  £  = 
{or,  (72, . . . ,  <T|£i},  where  |£|  denotes  the  cardinality  of 
£.  Realizations  of  G[ .... .  G"  will  be  denoted  by  (f 
and  also  take  values  in  £.  We  assume  that  the  joint 
probability  distribution  p$(G1, . . . ,  Gn)  is  parameterized 
by  some  parameter  9  £  Cl,  where  O  is  the  set  where  9 
takes  values.  The  parameter  9  is  considered  unknown  and 
{pe(G1, . . . ,  Gn)  :  V  9  £  fi}  can  be  viewed  as  a  family 
of  PLs  characterizing  Ho-  As  we  will  see  later,  9  will 
range  over  the  various  modes  of  the  traffic  at  different 
time  intervals  and  O  will  be  a  discrete  set. 

We  will  treat  the  problem  of  deciding  whether  a 
realization  Q  of  is  anomalous  as  the  binary  com¬ 
posite  hypothesis  testing  problem  between  Ho  and  the 
complement  of  Ho  denoted  by  Ha-  We  call  such  a  test 
composite  because  9  is  considered  unknown.  A  decision 
rule  S  is  a  set  such  that  Q  £  S  =  {G\Ho  is  rejected},  in¬ 
dicating  an  anomaly,  and  Q  f  S  =  {Q  \  Ha  is  accepted}, 
indicating  no  anomaly.  For  a  decision  rule  S ,  we  define 
a5  (0)  =  Pe\H0[S  €E  S']  to  be  the  false  alarm  rate,  and 
/ 3s  ( 9 )  =  Pg\yi0[Q  f.  S]  to  be  the  miss  detection  rate, 
where  Pg\-H0[-\  is  the  probability  evaluated  assuming  Ho 
is  true  and  Ps\-h0  [■]  is  the  probability  evaluated  assuming 
the  alternative  hypothesis  is  true. 

We  use  the  term  exponent  to  refer  to  the  quantity 
—  linin^cx;,  T  log  P[-]  for  some  probability  P[-],  if  the 
limit  exists.  If  the  exponent  is  d ,  then  the  probability  ap¬ 
proaches  zero  as  e~nd.  We  next  present  the  Generalized 
Neyman-Pearson  Criterion  for  decision  rules. 

Definition  1 

(Generalized  Neyman-Pearson  (GNP)  Criterion).  A  de¬ 
cision  rule  S  is  optimal  if  it  satisfies 

1  c 

lim  sup  —  log  or  (0)  <  —A,  (1) 

n — »oo  77, 

and  maximizes  —  lim.,,^00  sup  log^  ^  uniformly  for  all 
0  C  L>. 

Because  the  joint  distribution  pg(G1, . . . ,  Gn)  becomes 
complex  when  n  is  large,  we  focus  on  two  types  of 
simplification.  One  is  to  assume  all  random  variables  G‘ 
are  i.i.d.,  the  other  is  to  assume  the  stochastic  process  G 
is  a  Markov  chain. 

A.  A  model-free  method 

We  first  propose  a  method  we  call  model-free,  where 
the  random  variables  Gl  are  i.i.d..  Each  G?  takes  the 
value  Oj  with  probability  pF  (Gl  =  <7j),  j  =  1, . . . ,  |£|, 
which  is  parameterized  by  9  £  O.  We  refer  to  the 
vector  =  (pg  (G*  =  rxi), . . .  (Gl  =  (J|S|))  as 
the  model-free  Probability  Law  (PL)  corresponding  to  9. 
Then  the  family  of  model-free  PLs  VF  =  {p^  :  9  S  0} 
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characterizes  the  stochastic  process  Sf.  To  characterize 
the  observation  Q ,  let 

1  n 

£F(ai)  =  =  &j),  j  =  (2) 

11  i= 1 

where  l(-)  is  an  indicator  function.  Then,  an  estimate  for 
the  underlying  model-free  PL  based  on  the  observation 
Q  is  £®  =  {£p(<Jj)  :  j  =  1,  •  •  • ,  |£|},  which  is  called 
the  model-free  empirical  measure  of  Q. 

Suppose  p,  =  (/i(cri), . . . ,  At(cr|E|))  is  a  model-free  PL 
and  v  =  (z/(<r i), . . . ,  v(cr |s|))  is  a  model-free  empirical 
measure.  To  quantify  the  difference  between  //  and  u, 
we  define  the  model-free  divergence  between  them  as 

|S|  (  T 

Df(v\\h)  =  ^2  v(aj)  log  fTZf-  (3) 

The  anomaly  detection  test  is  given  in  the  following  def¬ 
inition.  Notice  that  the  minimization  over  9  is  selecting 
the  PL  with  the  minimal  exponent,  or  equivalently  the 
largest  probability,  hence,  the  most  likely  PL. 

Definition  2 

( Model-Free  Generalized  Hoeffding  Test).  The  model- 
free  generalized  Hoeffding  test  [14]  is  to  reject  Ho  when 
Q  is  in  the  set: 

^  =  inf  Df{£% ||pf)  >  A}, 
where  A  is  a  detection  threshold. 

We  call  infflgn  Dp{£^F ||pf )  the  generalized  model-free 
divergence  between  £®  and  VF .  A  similar  definition 
has  been  proposed  for  robust  localization  in  sensor 
networks  [15].  We  introduce  the  following  theorem. 

Theorem  II.  1  The  model-free  generalized  Hoeffding 
test  satisfies  the  GNP  criterion. 

Proof:  See  Appendix  A.  ■ 

We  remark  that  when  applying  this  detection  rule  in 
practice  we  substitute  //  and  v  in  (3)  with  fi  and  u  where 
vfjj)  =  max( v (aj ), e)  and  =  ma x(p(oj),e),\/j 

and  e  is  a  small  positive  constant  introduced  to  avoid 
underflow  and  division  by  zero. 

B.  A  model-based  method 

We  now  turn  to  the  model-based  method  where 
the  random  process  ^  =  {G1,...,Gn}  is  assumed 
to  be  a  Markov  chain.  Under  this  assumption,  and 
for  Q  =  {g1, . . . ,  g11}.  the  joint  distribution  of  If 
becomes  pg  (Sf  =  G)  =  pf  (g1)  Ui=i  Pe  ( 9i+1  I  5®)’ 
where  pB(-)  is  the  initial  distribution  and  pB  (•  |  •)  is 
the  transition  probability;  all  parametrized  by  9  G  O. 


Let  pF  (tr.jjtJj)  be  the  probability  of  seeing  two  con¬ 
secutive  states  (<Ji,<Tj).  We  refer  to  the  matrix  Pg  = 
{pg  (a,.  crj)}|^=1  as  the  model-based  PL  associated 
with  9  G  PI.  Then,  we  can  interpret  VB  =  {P B  :  9  G  0} 
as  a  family  of  model-based  PLs.  To  characterize  the 
observation  Q,  let  for  all  i,  j  =  1, . . . ,  |S| 

1  n 

=  ai'gl  =  ai)' 

n  1=2 

We  define  the  model-based  empirical  measure  of  Q  as  the 
matrix  £gB  =  {£B(oi,  crJ-)}|:^=1 .  The  transition  probabil¬ 
ity  from  cr,  to  <jj  is  simply  £%{aj\af)  =  — j|f (tTgi’<Tj)  . 

2^7  =  1  &B  \<7i  ’a3  ) 

I  Si 

Suppose  II  =  { 7r ( cji ,  <t,-)L  •_1  is  a  model-based  PL 
and  Q  =  {^(cr.j,  o'j)}  P)'=1  is  a  model-based  empirical 
measure.  Let  fc(oj  \ai)  and  q(oj  |cr,:)  be  the  corresponding 
transition  probabilities  from  ct,;  to  crj.  Then,  the  model- 
based  divergence  between  II  and  Q  is 

db(  q  ii  n)  =  q(ai'  aj)  lQg  q\CJj  r4  •  (5) 

Similar  to  the  model-free  case  we  define: 

Definition  3 

(Model-Based  Generalized  Hoeffding  Test).  The  model- 
based  generalized  Hoeffding  test  is  to  reject  Ho  when  Q 
is  in  the  set: 

S*B={g\  inf  DB{£%\\PB)>\}, 
where  A  is  a  detection  threshold. 

As  in  the  model-free  case,  we  will  be  calling 
infflen  DB(£%\\Pg)  the  generalized  model-based  di¬ 
vergence  between  £Lf,  and  VB .  GNP  optimality  can  be 
established  in  this  case,  too  (cf.  Def.  1). 

Theorem  II.2  The  model-based  generalized  Hoeffding 
test  satisfies  the  GNP  criterion. 

Proof:  See  Appendix  B.  ■ 

As  in  the  model-free  case,  when  applying  this  de¬ 
tection  rule  in  practice  we  substitute  q(<Ji,crj)  = 
max(g(cr.i,  <Tj),  e)  and  n(oi,Oj)  =  max(7r(<7j,  aj),  e)  in 
the  place  of  q(a.i,aj)  and  7r((j,;,ijj)  in  (5),  respectively, 
where  £  is  some  small  positive  constant  introduced  to 
avoid  underflow  and  division  by  zero. 

III.  Network  anomaly  detection 

In  this  section  we  present  our  anomaly  detection 
methods  whose  structure  is  outlined  in  Fig.  1.  First 
we  use  reference  Hows  to  estimate  a  family  of  PLs 
that  characterize  the  normal  operation  of  the  network. 
This  involves  selecting  a  large  candidate  set  of  PLs 
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Fig.  1.  Structure  of  the  algorithms. 

and  then  refining  it  through  optimization.  Then,  we 
test  windows  of  traffic  against  this  family  using  the 
generalized  Hoejfding  tests  in  Definition  2  or  Definition 
3. 


A.  Data  representation 

1 )  Network  traffic  representation  and  flow  aggrega¬ 
tion:  Till  now,  we  deliberately  formulated  problems  in  a 
general  way.  In  fact,  the  only  requirement  of  our  methods 
on  data  is  that  each  data  record  should  be  comprised  of 
a  time  stamp  and  some  additional  features.  As  a  result, 
our  methods  can  be  used  in  different  applications  by 
choosing  appropriate  features. 

In  this  paper,  however,  we  focus  on  host-based 
anomaly  detection,  a  specific  application  in  which  we 
monitor  the  incoming  and  outgoing  packets  of  a  server. 
We  assume  that  the  server  provides  only  one  service 
(e.g.,  web  server)  and  other  ports  are  either  closed  or 
outside  our  interests.  As  a  result,  we  only  monitor  traffic 
on  a  certain  port  (e.g.,  port  80  for  web  service).  For 
servers  with  multiple  ports  in  need  of  monitoring,  we 
can  run  our  method  on  each  port. 

The  features  we  propose  for  this  particular  application 
relate  to  a  flow  representation  slightly  different  from 
that  of  commercial  vendors  like  the  Cisco  NetFlow 
[16].  Hereafter,  we  will  use  the  terms  “flows”,  “traffic”, 
and  “data”  interchangeably.  Let  {s1, . . . ,  sm}  denote  the 
collection  of  all  packets  collected  on  the  monitored  port 
of  the  host.  Since  the  server  IP  is  always  fixed,  it  will 
be  ignored  and  we  will  only  track  the  user  IP  address, 
denoted  by  x*  for  packet  s\  The  size  of  s*  is  If  £  [0,  oo) 
in  bytes  and  the  start  time  of  transmission  is  t‘  £  [0,  oo) 
in  seconds.  Using  this  convention,  packet  s’  can  be 
represented  as  (x\  bl,  tls )  for  all  i  =  1, . . . ,  m. 

We  compile  a  sequence  of  packets  s1  = 
(x\6\fi),...,sm  =  (xm,6m,i™)  with  t\  <  •••  <  t™ 
into  a  flow  f  =  ( x,b,dt,t )  if  x  =  x1  =  •  •  •  =  xm  and 
t\  —  tl~l  <  5f  for  i  =  2, . . .  ,m  and  some  prescribed 
6f  £  (0,  oo).  Here,  the  flow  size  b  is  the  sum  of  the 
sizes  of  the  packets  that  comprise  the  flow.  The  flow 
duration  is  dt  =  t™  —  t\.  The  flow  transmission  time  t 
equals  the  start  time  of  the  first  packet  of  the  flow,  i.e.. 


t\.  In  this  way,  we  can  translate  the  large  collection 
of  packets  into  a  relatively  small  collection  of  flows 
T .  We  then  cluster  the  user  IP  addresses  into  different 
groups.  For  simplicity  of  notation,  we  only  consider 
IPv4  addresses.  If  x*  =  (x\,x\,  xl3,  x\)  £  {0, . . . ,  255}4 
and  xJ  =  (x{,  x32,  x3,  xJ4)  £  {0,...,255}4  are  two 
IPv4  addresses,  the  distance  between  them  is  defined 
as  d(x\xJ)  =  \x\  -  ar[|2563  +  \x\  -  x^|2562  + 
\x3  —  xj3\256  +  \x\  —  xJA\.  This  metric  can  be  easily 
extended  to  IPv6  addresses.  1  Suppose  X  is  the  set  of 
unique  IP  addresses  in  T.  We  apply  typical  A'-means 
clustering  on  X  with  the  distance  metric  defined  above. 
For  each  x  €  X,  we  thus  obtain  a  cluster  label  /c(x). 
Suppose  the  cluster  center  for  cluster  k  is  xfc;  then, 
the  distance  of  x  to  the  corresponding  cluster  center 
is  da(x)  =  d(x,xfc(x)).  The  cluster  label  fc(x)  and 
distance  to  cluster  center  da(x)  are  used  to  identify  a 
user  IP  address  x,  leading  to  our  final  representation  of 
a  flow  as: 

f  =  (k(x),da(x),b,dt,t).  (6) 

2)  Quantization:  For  each  f\  the  cluster  label  /.:(x') 
is  discrete  and  suppose  the  number  of  user  clusters  is  K. 
We  quantize  da(xl),  If,  and  d\  to  discrete  values.  After 
quantization,  each  tuple  (fc(x),  d0(x),  b,  dt)  corresponds 
to  a  symbol  (tuple  of  discrete  values)  in  a  composite 
alphabet  S.  We  will  denote  by  g*  the  symbol  corre¬ 
sponding  to  flow  f\  For  every  (discretized)  flow  g\  we 
will  refer  to  the  elements  in  g'  corresponding  to  /c(x'), 
da(x*),  bl,  and  dlt  as  features  1,  2,  3,  and  4,  respectively. 

3)  Window  Aggregation:  In  our  methods,  flows  in 
some  reference  set  T  are  further  aggregated  into  win¬ 
dows  based  on  their  flow  transmission  times.  A  window 
is  a  detection  unit  that  consists  of  flows  in  a  continuous 
time  range  that  are  to  be  processed  together.  Let  h  be 
the  interval  between  the  start  points  of  two  consecutive 
time  windows  and  ws  be  an  appropriate  window  size. 
Flow  f '  belongs  to  window  j  if  its  transmission  time  f' 
satisfies  t 1  +  (j  —  1  )h  <  tl  <  t1  +  ( j  —  1  )h  +  ws.  Tj 
will  denote  the  collection  of  flows  in  a  window  j  and 
we  will  use  G,  to  denote  the  discretized  version  of  Tr 


B.  Estimating  PLs  from  reference  traffic 

The  purpose  now  is  to  estimate  families  of  model-free 
and  model-based  PLs  {p^  :  9  £  Gl}  and  {P  f  :  9  £  O}, 
respectively,  from  a  discretized  collection  of  flows  Qref 
with  flow  transmission  times  t  =  {t1, . . . ,  fn}.  We  will 
partition  Qref  into  segments  and  denote  by  A f(j)  the 
indices  of  all  flows  in  segment  j.  Depending  on  the  time- 
varying  traffic  pattern,  the  partitioning  of  Gref  will  be 
done  in  a  way  such  that  flows  in  each  A 4(j)  can  be 

'Other  distance  metrics  can  also  be  used  for  clustering,  including 
ones  that  take  into  account  the  geographical  location  of  IP  addresses. 
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Fig.  2.  The  relationship  between  flow  transmission  time  t  and  indices 
of  flows  M  (Ji )  governed  by  PL  j  for  shifting  networks  (A)  and  periodic 
networks  (B). 

assumed  to  be  generated  from  the  same  PL.  Thus,  each 
set  of  flows  A4(j)  can  be  seen  as  a  window  from  which 
a  model-free  or  model-based  empirical  measure  can  be 
derived  using  either  (2)  or  (4);  this  empirical  measure  is 
the  PL  derived  from  M(j).  Motivated  by  two  represen¬ 
tative  types  of  time-varying  networks,  we  consider  two 
approaches  of  partitioning  Qref  into  segments. 

1)  Shifting  networks:  The  first  type  of  networks  we 
consider  will  be  called  shifting  networks,  loosely  speak¬ 
ing  these  are  networks  where  properties  of  normal  traffic 
slowly  “shift”  with  respect  to  time.  This  suggests  that  the 
flows  close  in  time  are  more  likely  to  be  governed  by 
the  same  PL.  We  divide  Qref  into  segments,  each  with 
a  duration  of  a  prescribed  value  td .  The  flows  in  each 
segment  are  used  to  estimate  a  single  PL  (Fig.  2A).  The 
flow  indices  in  segment  j  are 

M{j)  =  {i  :  t1  +  ( j  -  1  )td  <  t*  <  t1  +jtd},  (7) 

where  j  =  1, . . . ,  [(tra  —  t1)/td\.  td  characterizes  how 
quickly  we  expect  the  statistical  properties  of  the  traffic 
to  shift.  Larger  td  indicates  a  slower  shifting  of  traffic 
properties  in  the  network.  One  can  choose  a  variety  of 
td  s,  and  for  each  td  generate  the  corresponding  A 4(j) 
and  the  resulting  PLs. 

2)  Periodic  networks:  The  second  approach  is  mo¬ 
tivated  by  periodic  networks  where  properties  of  the 
normal  traffic  change  periodically.  In  these  networks, 
two  flows  can  be  governed  by  the  same  PL  either  if 
their  transmission  times  are  close  or  if  the  difference  of 
their  transmission  times  equals  the  period  (Fig.  2B).  Let 
td  characterize  shifts  within  the  period  and  let  tp  be  the 
period.  For  j  =  1, . . . ,  [tp/td J,  let 

M(j)  =  U  {i:  ktp  +  ( j  -  1  )td  <  f  <  ktp  +  jtd}  , 

fcG/Cj 

(8) 

where 

ICj  =  {k  :  ktp  +  ( j  —  l)td  >  t1  and  ktp  +  jtd  <  tn}  . 

Again,  we  can  choose  a  variety  of  tp’s  and  td’s  with 
each  combination  resulting  into  [tp/td\  PLs. 

Practical  networks  can  exhibit  both  types  of  non¬ 
stationary  behavior  described  above.  Moreover,  the  peri¬ 
odicity  and  the  degree  of  shift  may  change  over  time,  too. 


Fig.  3.  Histogram  of  intervals  between  two  consecutive  flows  with  a 
specific  feature  quantized  to  the  same  discrete  value. 

To  increase  the  robustness  of  the  PL  estimation  to  these 
non-stationarities,  we  first  propose  a  large  collection  of 
candidates  and  then  refine  it  using  integer  programming. 

C.  Generating  a  large  collection  of  PLs 

To  generate  a  collection  of  PLs  -  one  for  each  segment 
j\4  (j )  -  we  need  to  estimate  td  and  tp  described  in  the 
previous  section.  We  have  a  reference  sequence  of  quan¬ 
tized  flows  Qref  =  {g1, . . . .  g"  }  and  the  corresponding 
flow  transmission  times  t  =  {f1, . . . ,  fn}.  Recall  that 
features  of  flows  are  quantized  into  discrete  values.  For 
each  feature  a,  we  let  the  collection  of  all  corresponding 
discrete  values  be  E“  =  {er“,  ■  ■  ■ ,  cr^.a |}.  For  all  a,  b,  we 
say  a  flow  g1  belongs  to  channel  a-b  if  its  ath  feature  gla 
equals  ab.  We  first  analyze  each  channel  separately  to 
get  a  rough  estimate  of  td  and  tp.  Then,  channels  of  the 
same  feature  are  aggregated  to  get  a  combined  estimate. 

1)  Estimation  in  one  channel:  Let  Iab  = 
{fl  :  gla  =  ob }  be  the  sorted  sequence  of  flow 
transmission  times  for  flows  in  channel  a-b.  The 
interval  between  two  consecutive  flows  in  channel  a-b 
is  Tab  =  tab  -  tkaf\k  =  2, . . . ,  \lab\,  where  tkab  is  the 
fcth  element  in  lab- 

For  shifting  networks,  since  the  majority  of  flows  in 
each  channel  belong  to  a  continuous  time  range,  the 
intervals  between  two  consecutive  flows  are  small.  The 
histogram  of  the  intervals  {rkb  :  k  =  2, ,  \Tab\]  will 
have  a  heavy  head  that  contains  most  of  the  mass  within 
a  certain  time  close  to  zero  (Region  1  in  Fig.  3).  The 
end  of  the  time  interval  containing  most  of  the  mass  can 
be  used  as  an  upper  bound  on  the  interval  between  two 
consecutive  flows  and  is  a  good  option  for  td. 

For  periodic  networks,  the  histogram  for  intervals  in 
iTab  ■  k  =  2, . . . ,  \Xab\}  is  also  heavily  skewed  to  small 
intervals,  thus,  td  can  be  estimated  in  the  same  way  as 
with  shifting  networks.  However,  the  intervals  between 
two  consecutive  flows  can  be  large.  Fig.  4  shows  an 
example  of  a  feature  that  exhibits  periodicity.  There  will 
be  two  peaks  around  tp\  and  tp2  in  the  histogram  of  inter¬ 
vals  for  flows  whose  values  are  between  the  two  dashed 
lines.  We  can  select  tp  such  that  ( tp\  +  tpf)  /2  «  tp/2. 
In  practice,  there  can  be  multiple  peaks  because  of  the 
randomness  in  the  network,  and  we  can  choose  their 
mean  as  an  approximation  of  t.p/2  (cf.  Fig.  3). 
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Fig.  4.  Illustration  of  the  peaks  in  Region  2  of  Fig.  3. 

2)  Aggregation  for  channels  of  the  same  feature: 
Denote  the  estimate  of  td  and  tp  based  on  channel  a-b 
as  tdb  and  t“b,  respectively.  We  use  the  subscript  {d,p} 
to  unify  the  notations  for  both  estimates.  tpb  =  0  if  no 
periodicity  is  found  in  channel  a-b.  For  a  =  1,2,  3, 4,  let 
T&p}  =  |£a|,  and  tfdp}  >  0}  be 

the  collection  of  estimates  for  all  channels  of  feature  a. 
We  define  the  combined  estimate  of  /,/  and  tp  for  feature 
a  as  ta{d,P }  =  Mean(TKp})’  where  Mean(-)  calculates 
the  sample  mean  of  a  set. 

If  Tjj  is  empty,  the  network  is  non-periodic  according 
to  feature  a,  thus,  a  family  of  candidate  PLs  can  be 
generated  using  td  and  (7).  Otherwise,  the  network  is 
periodic  according  to  feature  a,  and  a  family  of  candidate 
PLs  can  be  generated  using  td,  tp,  and  (8).  In  addition, 
in  case  that  some  prior  knowledge  of  td  and  tp  is 
available,  the  family  of  candidate  PLs  can  include  the 
PLs  calculated  based  on  this  prior  knowledge. 

D.  PL  refinement  with  integer  programming 

From  Sec.  III-C,  we  now  have  a  large  family  of 
candidates  for  PLs.  The  larger  this  family  is,  the  more 
likely  it  is  to  overfit  Qref.  Furthermore,  a  smaller  family 
of  PLs  is  desirable  since  it  reduces  the  computational 
cost  of  anomaly  detection.  This  section  introduces  a 
method  to  refine  the  family  of  candidate  PLs. 

For  simplicity,  we  will  only  describe  the  procedure 
for  the  model-free  method.  The  procedure  for  the  model- 
based  method  is  similar.  To  make  the  exposition  concise, 
when  referring  to  the  divergence  between  the  empirical 
measure  of  a  set  of  flows  and  a  PL,  we  will  simply  say 
the  divergence  between  the  set  of  flows  and  the  PL. 

Suppose  the  family  (namely  the  set)  of  candidate 
PLs  is  V  =  {pf , . . . ,  pjy}  of  cardinality  N.  Because 
no  alarm  should  be  reported  for  Qref,  or  segments  of 
Gref,  our  primary  objective  is  to  choose  the  smallest 
set  VF  C  V  such  that  there  is  no  alarm  for  Gref- 
We  aggregate  Gref  into  M  windows  as  outlined  in 
Sec.  III-A  and  denote  the  data  in  window  i  as  G'ref- 

Let  Dij  =  ||  pF)  be  the  divergence  between 

flows  in  window  i  and  PL  j  for  i  =  1 , ,M  and 
j  =  1  We  say  window  i  is  covered  (namely, 

reported  as  normal)  by  PL  j  if  Dt.j  <  A.  Here,  A  is  the 
same  threshold  we  used  in  Def.  2.  With  this  definition. 


the  primary  objective  becomes  to  select  the  minimum 
number  of  PLs  to  cover  all  the  windows. 

There  may  be  more  than  one  subsets  of  V  having  the 
same  cardinality  and  covering  all  windows.  We  propose 
a  secondary  objective  characterizing  the  variation  of 
a  set  of  PLs.  Let  ffj  =  {i  :  D,:)  <  A}  be  the 

index  set  of  windows  covered  by  PL  j  and  denote  by 
Nj1\...,N^3^  the  ordered  elements  of  Nj.  Define 
V3  =  {Nf  -  :  i  =  2, . . . ,  \Nj\}  the  set  of 

differences  between  consecutive  window  indices  covered 
by  PL  j.  The  coefficient  of  variation  for  PL  j  is  defined 
as  4  =  Std(2?j) /Mean(Dj),  where  Std(Dj)  and 
Mean(2?j)  are  the  sample  standard  deviation  and  mean 
of  set  Vj,  respectively.  A  smaller  coefficient  of  variation 
means  that  the  PL  is  more  “regular.”  The  secondary 
objective  is  to  minimize  the  sum  of  coefficients  of 
variation  for  selected  PLs.  We  formulate  PL  selection 
as  a  weighted  set  cover  problem  in  which  the  weight 
of  PL  j  is  1  +  7 c{„  where  7  is  a  small  weight  for  the 
secondary  objective.  Let  x,  be  the  0-1  variable  indicating 
whether  PL  i  is  selected  and  let  x  =  (27, . . . ,  Xjy).  Let 
A  =  {atj}  be  an  M  x  N  matrix  whose  (i,j)th  element 
a.ij  is  set  to  1  if  <  A  and  to  0  otherwise.  Let 
cv  =  (Cy,  ■  ■  ■ ,  Cy).  Selecting  a  set  VF(C  V)  of  PLs 
can  be  formulated  as  an  integer  programming  problem: 

min  1  x  +  7C,„x 

s.t.  Ax  >  1,  (9) 

Xj  G  {0,1},  j  =  1,...,N, 

where  1  is  a  vector  of  ones.  The  objective  is  a  weighted 
sum  of  the  primary  cost  and  the  secondary  cost.  The 
first  constraint  enforces  there  is  no  alarm  for  Gref- 

function  HeuristicRefinePl(A,  cv,  r,  yth) 

Init:  bestCost  :=  00,  7  :=  1,  x*  :=  0 

while  7  >  7 th  do 

x  :=  GreedySolve(A,  7,  c„),  7  :=  r'y 

if  1  x  +  7t/lc„x  <  bestCost  then 
bestCost  :=  1  x  +  7t/lct,x 
x*  :=  x 

end  if 
end  while 
return  x* 
end  function 

function  GreedySetCover(A,  7,  cv) 

Init:  x°  :=  0,  C  :=  0 
while  |Cj  <  M  do 

j+  :=  arg  maxj:x[,]=0 

x[j+]  :=  1,  C:=CU{j:  Ojj+  =  1} 

end  while 
return  x 
end  function 

Algorithm  1:  Heuristic  algorithm  for  PL  refinement. 
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Because  (9)  is  NP-hard,  we  propose  a  heuristic 
algorithm  to  solve  it  (see  the  display  Algorithm  1). 
HeurisTICRefinePl  is  the  main  procedure  whose  pa¬ 
rameters  are  A,  cv,  a  discount  ratio  r  <  1,  and  a  termi¬ 
nation  threshold  yth-  In  each  iteration,  the  algorithm  de¬ 
creases  7  by  a  ratio  r  and  calls  the  GreedySetCover 
procedure  to  solve  (9).  The  algorithm  terminates  when 
7  <  7 th-  In  the  initial  iterations,  the  weight  7  for  the 
secondary  cost  is  large  so  that  the  algorithm  explores 
solutions  which  select  PLs  with  less  variation.  Later,  the 
weight  7  decreases  to  ensure  that  the  primary  objective 
plays  the  main  role.  Parameters  7 th  and  r  determine 
the  algorithm’s  degree  of  exploration,  which  helps  avoid 
local  minima.  In  practice,  we  can  choose  small  7 th  and 
large  r  if  we  have  enough  computation  power. 

GreedySetCover  uses  the  ratio  of  the  number  of 
uncovered  windows  a  PL  can  cover  and  the  cost  1  +  7C„ 
as  heuristics.  GreedySetCover  will  add  the  PL  with 
the  maximum  heuristic  value  to  VF  until  all  windows 
are  covered  by  the  PLs  in  VF .  Suppose  the  return  value 
of  HeurisTICRefinePl  is  x*.  Then,  the  refined  family 
of  PLs  is  VF  =  {pF  :  x*  >  0,  j  =  1, . . . ,  N}. 

Once  we  have  a  family  of  PLs,  then  anomaly  detection 
for  either  the  model-free  or  the  model-based  case  can 
be  done  by  using  either  the  test  of  Definition  2  or 
Definition  3  and  comparing  the  family  of  PLs  against 
the  empirical  measure  of  windows  of  current  activity 
(see  also  Fig.  1).  Thus,  the  proposed  method  processes 
windows  of  current  activity  one  after  the  other  and 
indicates  which  windows  are  found  anomalous. 

IV.  Simulation  results 

Lacking  data  with  annotated  anomalies  is  a  common 
problem  for  validation  of  anomaly  detection  methods.  To 
that  end,  we  developed  an  open  source  software  package 
SADIT  [17]  to  provide  flow-level  datasets  with  annotated 
anomalies.  Based  on  the /s- simulator  [18],  SADIT  simu¬ 
lates  normal  and  abnormal  flows  in  networks  efficiently. 

Our  simulated  network  consists  of  an  internal  (to 
an  organization)  network  and  several  Internet  nodes 
(Fig.  5).  The  internal  network  consists  of  8  normal 
nodes  CT1-CT8  and  1  server  ( SRV )  containing  some 
sensitive  information.  There  are  also  three  Internet  nodes 
INT1-INT3  that  access  the  internal  network  through  a 
gateway  (GATEWAY).  For  all  links,  the  link  capacity 
is  10  Mb/s  and  the  delay  is  0.01  s.  All  internal  and 
Internet  nodes  communicate  with  the  SRV  and  there  is  no 
communication  between  other  nodes.  The  normal  flows 
from  all  nodes  to  SRV  have  the  same  characteristics. 

We  consider  two  representative  types  of  changing 
patterns  for  normal  traffic:  shifting  pattern,  a  com¬ 
mon  pattern  for  traffic  to  “booming”  web  services, 
and  day-night  pattern,  a  common  pattern  for  traffic  to 
services  with  geographically  concentrated  users  (e.g.. 


www.boston.com).  For  the  shifting  pattern,  we  as¬ 
sume  the  size  of  the  normal  flows  follows  a  Gaus¬ 
sian  distribution  N(m(t),o2).  For  the  day-night  pat¬ 
tern,  we  consider  both  the  case  when  the  flow  size 
follows  a  Gaussian  distribution  N(m(t),  a2)  and  the  case 
when  the  flow  size  follows  a  log-normal  distribution 
In  N(m(t),  a).  For  all  cases,  the  arrival  process  of  flows 
is  assumed  to  be  a  Poisson  process  with  arrival  rate 
77(f).  Both  777(f)  and  77(f)  change  with  time  f.  For  both 
patterns,  we  monitor  the  traffic  on  the  server  and  evaluate 
the  performance  of  the  robust  model-free  and  model- 
based  methods  for  an  anomaly  associated  with  attacks 
in  which  some  hackers  exfiltrate  sensitive  information 
through  SQL  injection  [19]. 

A.  Shifting  pattern 

When  a  web  service  is  booming,  users  tend  to  generate 
and  download  more  content  from  its  servers.  From  the 
flow  perspective,  this  means  that  the  average  flow  size  is 
shifting  to  higher  values.  As  a  simple  model,  we  assume 
the  flow  size  of  all  users  follows  a  Gaussian  distribution 
N(m(t),o2)  truncated  to  be  positive,  where  777(f)  is  a 
linear  function  of  time  as  777(f)  =  at  +  b  and  77(f)  is  a 
constant,  a  and  b  are  two  parameters  that  characterize 
the  shift  of  the  traffic.  In  our  simulation,  we  set  b  =  4 
Mb,  a  =  3.6  Kb/h,  and  cr2  =  0.01  for  all  users.  The 
flow  arrival  rate  is  constant  and  77(f)  =  0.1  fps  (flows 
per  second).  Using  this  shifting  pattern ,  we  generate 
reference  traffic  Grej  for  one  week  (168  hours). 

For  window  aggregation,  both  the  window  size  ws 
and  the  interval  h  between  two  consecutive  windows  is 
2000  s.  The  number  of  user  clusters  is  K  =  2.  For  the 
quantization,  the  number  of  quantization  levels  for  the 
distance  to  cluster  center,  the  flow  size,  and  the  flow 
duration  (features  2,  3,  4)  are  2,  2,  and  8,  respectively. 

We  apply  the  procedures  described  in  Secs.  III-B  and 
III-C  to  Gref  to  identify  a  set  of  candidate  PLs.  Based  on 
flow  size,  the  traffic  is  non-periodic  and  the  combined 
estimate  of  td  is  3.89  h.  Our  PL  generation  approach 
leads  to  43  candidate  model-free  and  model-based  PLs, 
each  PL  being  calculated  using  a  segment  of  Qref. 

The  results  of  PL  refinement  for  the  model-free  case 
with  a  shifting  pattern  are  plotted  in  Fig.  6.  Subfigure  A 
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Fig.  6.  Results  of  PL  refinement  for  model-free  PLs  in  a  network 
with  shifting  pattern. 
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Fig.  7.  Results  of  PL  refinement  for  model-based  PLs  in  a 
network  with  shifting  pattern. 


plots  the  divergence  between  the  reference  traffic  and  all 
candidate  PLs,  Subfigure  B  plots  the  divergence  between 
the  reference  traffic  and  the  selected  PLs,  and  Subfigure 
C  depicts  the  “active  PL,”  whose  definition  will  be 
presented  later,  at  each  window.  The  a>axis  shows  the 
start  time  of  each  window.  Results  for  other  cases  will  be 
plotted  in  the  same  format.  For  the  model-free  method, 
because  m(t)  shifts  with  time,  the  PL  calculated  based 
on  the  flows  in  a  certain  window  has  small  divergence 
with  near-by  windows,  but  the  divergence  becomes  larger 
for  windows  further  away  (cf.  Fig.  6A).  There  are  4 
PLs  selected  by  the  PL  refinement  procedure  when  the 
detection  threshold  is  set  to  A  =  2  (Fig.  6B).  We  say  PL 
j*  is  active  during  window  i  if  its  divergence  with  traffic 
in  this  window  is  the  smallest  among  all  selected  PLs, 
namely  j*  =  argmim,  I)u .  Each  selected  PL  is  active 
for  a  continuous  range  of  time,  which  is  consistent  with 
the  fact  that  the  traffic  pattern  is  shifting  (Fig.  6C).  The 
active  PL  oscillates  between  the  former  active  and  a  new 
PL  before  it  switches  to  the  new  PL. 

For  the  model-based  method,  6  PLs  are  selected  by 
the  PL  refinement  procedure  when  A  =  2  (Fig.  7A,B). 
Each  PL  is  active  for  a  continuous  range  of  time  with 
similar  oscillations  as  in  the  model-free  method  during 
the  transition  between  two  active  PLs  (Fig.  1C). 

B.  Day-night  pattern 

The  traffic  of  local  web  services  usually  exhibits 
diurnal  variations  because  people  browse  websites  more 
frequently  during  the  day  than  during  the  night.  Fig.  8 
shows  the  normalized  average  traffic  to  American  social 
websites  over  a  day  [20], 

1 )  Flow  Size  with  Truncated  Gaussian  Distribution: 
For  this  case  study,  we  assume  the  flow  size  of  all  users 
follows  a  Gaussian  distribution  N (m(t) ,  a2)  truncated  to 
be  positive,  and  the  flows  arrive  according  to  a  Poisson 
process  with  rate  77(f).  Let  p(t)  be  the  function  shown  in 


Fig.  8,  and  assume  77(f)  =  Ap(t)  and  771(f)  =  M.pp(t ), 
where  A  and  Mv  are  the  peak  arrival  rate  and  the  peak 
mean  flow  size.  In  our  simulation,  we  set  Mp  =  4  Mb, 
er2  =  0.01,  and  A  =  0.1  fps  for  all  users.  Using  this 
day-night  pattern ,  we  generate  a  reference  traffic  trace 
Gref  for  one  week  (168  hours)  whose  start  time  is  5  pm. 
Again,  an  estimation  procedure  is  applied  to  estimate 
t,i  and  fp.  The  parameters  for  window  aggregation  and 
quantization  are  the  same  as  in  Sec.  IV-A. 

We  apply  the  procedures  described  in  Secs.  III-B  and 
III-C  to  Gref  to  identify  a  set  of  candidate  PLs.  The 
combined  estimate  of  the  period  based  on  flow  size 
is  24.56  h,  which  is  only  2.3%  away  from  the  real 
value  of  24  h.  For  the  model-free  method,  there  are 
64  candidate  model-free  PLs  proposed  in  the  estimation 
stage.  The  model-free  divergence  between  each  window 
and  each  candidate  PL  is  a  periodic  function  of  time, 
too.  Some  PLs  have  smaller  divergence  during  the  day 
while  others  have  smaller  divergence  during  the  night  (cf. 
Fig.  9A).  However,  no  PL  has  small  divergence  for  all 
windows.  3  PLs  out  of  the  64  candidates  are  selected 
when  the  detection  threshold  is  A  =  0.6  (cf.  Fig.  9B). 
The  3  selected  PLs  are  active  during  day,  night,  and  the 
transition  time  between  day  and  night,  respectively  (cf. 
Fig.  9C  for  the  active  PLs  of  all  windows). 

For  the  model-based  method,  there  are  64  candidate 
model-based  PLs,  too.  Similar  to  the  model-free  method, 
the  model-based  divergence  between  all  candidate  PLs 
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Fig.  8.  Traffic  pattern  of  social  network  websites. 
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and  flows  in  each  window  in  Gref  is  periodic  (Fig.  10A) 
and  there  is  no  PL  that  can  represent  all  the  reference 
data  Gref  -  2  PLs  are  selected  when  A  =  0.4  (Fig.  10B). 
One  PL  is  active  during  the  transition  time  and  the  other 
is  active  during  the  stationary  time ,  which  consists  of 
both  day  and  night  (Fig.  IOC). 

2)  Flow  Size  with  Log-normal  Distribution:  Many 
flows  correspond  to  file  transfers  and  [21]  argues  that  the 
sizes  of  transferred  files  in  the  Internet  follows  a  mixed 
model  in  which  the  body  of  the  file  size  distribution  can 
be  modeled  by  a  log-normal  distribution.  In  this  section, 
we  assume  the  flow  size  (in  bytes)  follows  a  log-normal 
distribution  In N(m(t),cr),  where  m(t)  =  Mpp(t)  is 
a  time-varying  parameter.  Again  p(t)  is  the  pattern 
function  depicted  in  Fig.  8.  We  let  Mp  =  7.881  (in 
bytes)  and  cr  =  1.339,  both  of  which  are  estimated  from 
real  network  traffic  [21],  We  assume  that  flows  arrive 
according  to  a  Poisson  process  with  rate  0.1  fps. 

For  the  model-free  method,  there  are  32  candidate 
model-free  PLs  proposed  in  the  estimation  stage  and 
only  3  PLs  are  selected  when  A  =  0.1  (Fig.  11).  One 
PL  represents  the  “high  activity”  pattern  between  10  pm 
and  1  am,  one  PL  represents  the  “low  activity”  pattern 
between  2  am  and  11  am,  and  one  PL  represents  the 
“moderate  activity”  pattern  for  the  remaining  time. 

For  the  model-based  method,  there  are  also  32  candi¬ 
date  model-based  PLs  and  only  3  PLs  are  selected  when 
A  =  0.1  (Fig.  12).  The  meaning  of  the  three  selected 
PLs  for  the  model-based  methods  is  similar  to  their 
counterparts  for  the  model-free  methods. 

The  results  show  that  the  PL  refinement  procedure  is 
effective  and  the  refined  family  of  PLs  is  meaningful. 
Each  PL  in  the  refined  family  corresponds  to  a  “pattern 
of  normal  behavior.”  This  information  is  useful  not  only 
for  anomaly  detection  but  also  for  understanding  the 
normal  traffic  in  time-varying  networks. 

C.  Comparison  with  vanilla  stochastic  methods 

For  both  types  of  normal  patterns  in  Secs.  IV-A  and 
IV-B,  we  compared  the  performance  of  our  robust  model- 
free  and  model-based  method  with  their  vanilla  coun¬ 
terparts  in  detecting  anomalies  ([5,22]).  In  the  vanilla 
methods,  all  reference  traffic  Gref  is  used  to  estimate  a 
single  PL.  We  used  all  methods  to  monitor  the  server 
SRV  for  one  week  (168  hours)  under  the  two  network 
traffic  patterns. 

We  considered  an  anomaly  in  which  node  CT 2  in¬ 
creases  the  mean  flow  size  by  30%  at  59  h  and  the 
increase  lasts  for  80  minutes  before  the  mean  returns 
to  its  normal  value.  This  type  of  anomaly  could  be 
associated  with  a  situation  when  attackers  try  to  exfiltrate 
sensitive  information  (e.g.,  user  accounts  and  passwords) 
through  SQL  injection  [19]. 


For  all  methods,  the  window  size  is  ws  =  2000  s  and 
the  interval  h  =  2000  s.  The  quantization  parameters 
are  equal  to  those  in  the  procedure  for  analyzing  the 
reference  traffic  Gref-  The  simulation  results  show  that 
the  robust  model-free  and  model-based  methods  perform 
better  than  their  vanilla  counterparts  for  both  types  of 
normal  traffic  patterns  (Fig.  13). 

For  the  case  when  normal  traffic  exhibits  a  shifting 
pattern ,  the  detection  threshold  A  equals  2.0  for  all 
methods.  The  vanilla  model-free  method  misses  the 
anomaly  when  the  normal  traffic  shows  a  shifting  pat¬ 
tern  (Fig.  13A).  Even  worse,  it  generates  false  alarms 
for  the  first  30  hours.  In  contrast,  the  robust  mode- 
free  method  detects  the  anomaly  successfully  without 
false  alarms  (Fig.  13B).  False  alarms  also  appear  in 
the  detection  results  of  the  vanilla  model-based  method, 
though  it  detects  the  anomaly  successfully  (Fig.  13C). 
Again,  the  robust  model-based  method  reports  no  false 
alarm  (Fig.  13D). 

Compared  with  the  shifting  pattern ,  the  day-night 
pattern  has  more  influence  on  the  results  from  the 
vanilla  methods.  The  second  row  of  Fig.  13  shows 
the  results  for  the  case  when  the  flow  size  follows  a 
Gaussian  distribution.  For  both  the  vanilla  and  the  robust 
model-free  methods,  the  detection  threshold  A  equals  0.6. 
The  vanilla  model -free  method  reports  all  night  traffic 
(between  3  am  to  11  am)  as  anomalies  (Fig.  13E). 
The  reason  is  that  the  night  traffic  is  lighter  than  the 
day  traffic,  so  the  PL  calculated  using  all  of  Gref  is 
dominated  by  the  day  pattern,  whereas  the  night  pattern 
is  underrepresented.  In  contrast,  because  both  the  day 
and  the  night  patterns  are  represented  in  the  refined 
family  of  PLs  (Fig.  9B),  the  robust  model-free  method 
is  not  influenced  by  the  fluctuation  of  normal  traffic  and 
successfully  detects  the  anomaly  (Fig.  13F). 

The  day-night  pattern  has  similar  effects  on  the  model- 
based  methods.  When  the  detection  threshold  A  equals 
0.4,  the  anomaly  is  barely  detectable  using  the  vanilla 
model-based  method  (Fig.  13G).  Similar  to  the  vanilla 
model-free  method,  the  divergence  is  higher  during  the 
transition  time  between  day  and  night  because  the  tran¬ 
sition  pattern  is  underrepresented  in  the  PL  calculated 
using  all  of  Gref  -  Again,  the  robust  model-based  method 
is  superior  because  both  the  transition  pattern  and  the 
stationary  pattern  are  well  represented  in  the  refined 
family  of  PLs  (Fig.  13H). 

The  third  row  of  Fig.  13  shows  the  results  for  the 
case  when  the  flow  size  follows  a  log-normal  distribution 
and  the  normal  traffic  has  a  day-night  pattern.  For  all 
methods  we  use  detection  threshold  A  =  0.1.  The 
anomaly  can  not  be  observed  in  both  the  vanilla  model- 
free  method  and  the  vanilla  model-based  method  due 
to  the  high  variance  of  the  normal  traffic  (Fig.  131  and 
Fig.  13K).  Again,  vanilla  methods  report  a  lot  of  false 
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Fig.  9.  Results  of  PL  refinement  for  the  model-free  method  in 
a  network  with  day-night  pattern  where  the  flow  size  follows  a 
Gaussian  distribution. 
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Fig.  10.  Results  of  PL  refinement  for  the  model-based  method 
in  a  network  with  day-night  pattern  where  the  flow  size  follows 
a  Gaussian  distribution. 


30 

d-25 
x  CD  20 
15 


sequence  no.  of  active  PLs 

"n  20  40  60  SO  100  120  140  160 


time  (h) 


Fig.  11.  Results  of  PL  refinement  for  the  model-free  method  in 
a  network  with  day-night  pattern  where  the  flow  size  follows  a 
log-normal  distribution. 


divergence  between  traffic  and  all  candidate  P Ls 


divergence  between  traffic  and  selected  PLs 

slif  ’ 
bIq!  n  n  n 


fin  ^ Tte-irO 


0.0 
30  F 


sequence  no.  of  selected  PLs 


:  illifiunijfiijfL/ijjfLjn 


10- 

5- 

oL 


20  40 


60 


80  100 

time  (h) 


120  140  160 


Fig.  12.  Results  of  PL  refinement  for  the  model-based  method 
in  a  network  with  day-night  pattern  where  the  flow  size  follows 
a  log-normal  distribution. 


alarms,  most  of  which  come  from  “moderate  activity” 
and  “low  activity”  patterns  which  are  underrepresented. 
In  contrast,  the  robust  methods  can  successfully  detect 
the  anomaly  (Fig.  13J  and  Fig.  13L). 

In  return  for  more  computations,  the  robust  methods 
can  significantly  reduce  false  alarms.  A  traffic  segment 
will  not  cause  an  alarm  as  long  as  it  can  be  explained 
by  any  one  of  the  PLs.  One  limitation  is  that  if  an 
attacker  emulates  the  day  pattern  during  the  night,  our 
methods  will  not  detect  it.  However,  this  may  be  hard 
to  accomplish  as  the  attacker  will  need  to  understand 
aggregate  traffic  statistics  only  seen  at  the  server. 

An  alternative  way  is  to  associate  a  PL  with  a  par¬ 
ticular  time-of-day  and  evaluate  traffic  using  that  PL 
during  those  times  (e.g.,  a  “day”  PL  and  a  “night”  PL). 
However,  this  approach  also  has  some  limitations.  First, 
it  is  not  general  enough  because  the  times  in  our  training 
data  do  not  have  a  clear  corresponding  relationship  with 
the  times  of  the  data  that  need  to  be  evaluated,  as  for 
example  in  the  shifting  case.  Second,  identifying  which 


PL  to  use  is  not  only  challenging  but  also  error-prone, 
and  such  errors  are  likely  to  be  the  bottleneck  of  overall 
performance.  In  contrast,  the  approach  we  presented  is 
much  more  robust  to  noise  and  is  likely  to  have  better 
performance  in  practice. 


V.  Conclusions 

The  statistical  properties  of  normal  traffic  are  time- 
varying  for  most  actual  communication  networks.  To 
address  limitations  of  earlier  methods  that  relied  on 
stationarity,  we  propose  a  robust  model-free  and  a  ro¬ 
bust  model-based  method,  which  can  generate  a  more 
complete  representation  of  the  normal  traffic  and  are 
robust  to  the  non-stationarity  in  networks.  Although  in 
this  paper  we  focus  on  host-based  anomaly  detection,  our 
methods  can  be  generalized  to  other  applications,  such 
as  monitoring  road  traffic  and  monitoring  data  collected 
in  manufacturing  plants. 
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Fig.  13.  Comparison  of  vanilla  and  robust  methods.  The  first  row  corresponds  to  the  case  of  shifting  pattern,  the  second  row  corresponds  to 
the  case  of  day-night  pattern  with  Gaussian  flow  size,  and  the  third  row  corresponds  to  the  case  of  day-night  pattern  with  log-normal  flow  size. 
The  first  and  second  columns  show  the  results  of  the  vanilla  and  robust  model-free  methods,  respectively.  The  third  and  fourth  columns  show 
the  results  of  the  vanilla  and  robust  model-based  methods,  respectively.  The  green  dashed  lines  indicate  the  detection  thresholds. 


Appendix  A 
Proof  of  Theorem  II.  1 

Proof:  Denote  by  Cn  =  {u  \  u  =  for  some  Q } 
the  set  of  all  possible  model-free  empirical  measures, 
i.e.,  types  (Def.  2.1.1  of  [8])  of  sequences  with  length 
n.  The  type  class  of  a  model-free  empirical  measure  v 
is  Tn(v )  =  {y  G  £n  |  £p  =  v'},  where  £”  denotes 
the  Cartesian  product  of  £  with  itself  n  times.  Note 
that  a  type  class  consists  of  all  permutations  of  a  given 
observation  sequence  y  in  this  set.  Suppose  P(iz|p^)  is 
the  probability  for  empirical  measure  u  under  some  PL 
Pif  (H0  is  correct).  According  to  Lemma  2.1.9  in  [8], 

(n  +  l)-|£|e-r*iMH|pf)  <  P(iz|pf )  <  e~nDp(u\lPe)  _ 

(10) 

For  all  9  G  Cl,  the  false  alarm  rate  of  the  model-free 
generalized  Hoeffding  test  is 

aSp(6)  =  Pe\Ha  [S  G  SF] 

E 

{HL(v)csj} 

<  e-nDF(v\\Pe) 

{H  T„(v)CS*} 

<  (n  +  l)|s|e"nA. 

The  first  inequality  comes  from  (10).  For  the  second 
inequality,  we  use  the  definition  of  Sp  and  the  fact  that 
Cn  <  (n+l)lsl  (cf.  Lemma  2.1.2  in  [8]) .  Furthermore, 

1  _« 

lim  sup—  log  ot F(9) 

n — kx>  77, 

<  lim  sup  -  log((n  +  l)lsle_nA)  = -A, 

n— kx)  77, 


which  proves  that  Sf  satisfies  (1).  Let  now  S  be  some 
other  decision  rule  satisfying  (1).  For  all  e  >  0  and  large 
enough  n,  A  log  as(9)  <  —A  —  e,  which  is  equivalent  to 

as{6)  <  e~n(x+e).  (11) 

In  addition,  we  have 

uS(9)  =  E  P(u\Pe) 

{y\ TnMcs} 

>  E  (n+l)-|E|e-"DF(l/Hp^. 

{v\Tn(y)<ZS} 

The  inequality  comes  from  (10).  If  n  is  large 
enough,  (n  +  >  e~ne.  Moreover,  (n  + 

1)-lsle_"£,J?(I/llp^)  >  0,  V  zz  G  Cn,  so 

as(9)  >  e-™(M"llp£')+0. 

Combined  with  (11),  we  obtain  e~n(DF(v^Po  )+e)  < 
e“"(A+e),  which  implies  Dp  (y  ||  p^)  >  A  for  all  v 
such  that  Tn(zz)  C  S  and  9  G  12.  Consequently,  S  C  S*F 
and  j3s  (6)  >  /3s F  (9)  for  all  9  G  fl,  so  the  model-free 
generalized  Hoeffding  test  satisfies  the  GNP  criterion.  ■ 

Appendix  B 
Proof  of  Theorem  II. 2 

Proof:  Let  Cn  =  {Q  :  Q  =  £%,  for  some  Q  G 
£n}  be  the  set  of  possible  model-based  empirical  mea¬ 
sures,  i.e.,  types  of  the  sample  paths  of  Markov  chains 
with  length  n.  Note  that  every  component  of  Q  G  Cn 
belongs  to  the  set  whose  cardinality  is 

n  +  1.  Q  is  specified  by  at  most  |£|2  such  quantities, 
which  means  \Cn\  <  (n  +  .  Suppose  P(Q|P^)  is 
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the  probability  for  observation  Q  when  under  PL  Pj/ 
(Ho  is  correct).  According  to  Lemma  3  in  [23], 

(n  +  l)-|£|2-|£|e-nDB(Q||P«)  <  p(Q|pf ) 

<  e-nDB(Q\\Ps) _  (12) 


Similar  to  the  model-free  case,  define  the  type  class  of  a 
model-based  empirical  measure  Q  to  be  Tn( Q)  =  {y  £ 
E"  |  =  Q}.  For  the  false  alarm  rate  of  the  model- 

based  generalized  Hoeffding  test  we  have 


as*B  (9) 


< 


< 


Pe[G  €  S*B] 

E  P(Qlpf) 

{Q|T„(Q)CS]J 


E 


e-nDB(Q||Pf) 


{Q|T„(Q)C5^} 

(n+l)|s|V"A 


where  the  first  inequality  comes  from  (12),  and  the 
second  one  is  because  \Cn\  is  bounded  by  (n  +  l)lsl  . 
Then, 

1  o* 

lim  sup  —  log  a  B  (9) 

n— ►  oo  77, 

<  lim  sup  —  log((n  +  l)^2 e~nX)  =  —  A. 

n— >oo  77 

So  the  model-based  generalized  Hoeffding  test  satisfies 
(1).  Let  now  S  be  some  other  decision  rule  which 
satisfies  (1),  for  all  e  >  0  and  large  enough  n.  Then, 

as(0)  <  e-"(A+e).  (13) 

Also, 

as(9)  =  E  P(Q|Pe  ) 

{Q|T„(Q)CS} 

>  (n  + 1)-(|S|2  +  |S|)e-nDB(Q||Pf)_ 

{Q|T„(Q)CS} 

The  inequality  comes  from  (12).  If  n  is  large  enough, 
(n  +  +lsl)  >  e_ne,  which  implies 

as(9)  >  e-n(£,B(Q!|p^)+e) 

for  all  Q  such  that  Tn{ Q)  £  S  and  for  9  £  PL.  Combin¬ 
ing  the  above  with  (13),  we  obtain  e~n('DB^p^')+e^  < 
e— ”(A+e),  and  Db{ Q  ||  Pf)  >  A  for  all  Q  s.t. 
Tn(Q)  £  S  and  9  £  PL.  Consequently,  S  C  SB 
and  (3s  (9)  >  (3s B  {9),  so  the  model-based  generalized 
Hoeffding  test  satisfies  the  GNP  criterion.  ■ 
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